QTM 447 Lecture 17: Seq2Seq and Attention

Kevin McAlister

March 18, 2025

Recurrent Neural Networks

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

Recurrent Neural Networks (RNNs) are the backbone of sequence modeling

  • Inputs or outputs make sense temporally

  • Knowing what came before or what comes after should inform what we predict at a current point in time

Adds knowledge to the NN architecture

  • Units need to talk backwards in time - all of the previous values need to influence what I guess next!

  • Note, we’re going to use time pretty loosely here - just some sort of meaningful temporal flow

If the order of the input matters, then we should preserve that information!

Recurrent Neural Networks

We can process a sequence of vectors, \(\mathbf x\), by applying a recurrence formula at every time step:

\[ h_t = f_w(h_{t - 1}, x_t) \]

  • \(h_{t - 1}\) is the old state (some vector of numbers)

  • \(x_t\) is the input state at time \(t\) (some vector)

  • \(f_w()\) is a function of some parameters, \(\mathbf W\) (some collection of vectors)

  • \(h_t\) is the new state (some vector)

Recurrent Neural Networks

The hidden state vector contains latent information about the current state of the model

  • For language models, it could tell us about the gender of pronouns in the sentence

  • Or the passiveness of the tone

This information should persist throughout the sequence of predictions!

  • Create memory from one prediction to the next.

Recurrent Neural Networks

The vanilla update:

\[ \underset{(m \times 1)}{\mathbf h_t} = \text{tanh}(\underset{(m \times m)}{\mathbf W_{hh}} \underset{(m \times 1)}{\mathbf h_{t-1}} + \underset{(m \times P)}{\mathbf W_{xh}} \underset{(P \times 1)}{\mathbf x_t} + \underset{(m \times 1}{\mathbf b_h}) \]

  • \(\mathbf h_t\) and \(\mathbf h_{t-1}\) are hidden vectors of size \(m\)

  • \(\mathbf W_{hh}\) is a \(m \times m\) matrix of weights that controls the mapping of one hidden state vector to the next

  • \(\mathbf W_{xh}\) is a \(m \times P\) matrix of weights that controls the mapping of the input to the hidden state

  • \(\mathbf b_h\) is a vector of bias terms

  • \(\text{tanh}()\) is an activation function

Recurrent Neural Networks

Recurrent Neural Networks

The RNN is said to have infinite memory

  • What it saw before will always play a part (albeit with diminishing influence) in what it predicts next!

  • Recurrence!

Seq2Seq

Today, we’re going to look at the most popular RNN framework - Seq2Seq

Seq2Seq

Any sequence of inputs to any sequence of outputs

  • Machine translation (French to English)

  • Character Addition

  • General Question and Answer!

Seq2Seq is a generic approach to the sequence-to-sequence prediction problem!

Seq2Seq

Seq2Seq

Seq2Seq

Seq2Seq

Seq2Seq

Seq2Seq

For each output, we look at:

  • The previous decoder state

  • The previous output

  • The final encoder state

Problem: The input sequence is bottlenecked through the final encoder state

  • Say a 512 vector that describes all of the language we previously saw

  • What if the length of the input was 1000?

  • Consider the aligned case - we \(\rightarrow\) estamos

    • Our encoder is likely only loosely related to that input at the end of the input sequence!

Seq2Seq

For large language models, seq2seq would need to summarize all of the inputs that it saw in the training stage in terms of a single vector of length 512

  • Large for small problems, but quite small for the entirety of the English language

  • Try to imagine all of the books you’ve ever read.

  • Imagine all of that literature compressed to a single 512 vector of numerical representations

  • Not really a good strategy…

Seq2Seq with Attention

Attention is a mechanism that allows the decoder to leverage all parts of the original inputs

  • Really the corresponding hidden states

At different steps of the decoder, let the model “focus” on different parts of the input!

  • The encoder still looks at the previous hidden state, but can also look at all previous hidden states seen in the encoder step

Seq2Seq with Attention

Seq2Seq with Attention

Seq2Seq with Attention

Seq2Seq with Attention

Seq2Seq with Attention

Seq2Seq with Attention

Seq2Seq with Attention

Seq2Seq with Attention

Seq2Seq with Attention

Seq2Seq with Attention

Input: “The agreement on the European Economic Area was signed in August 1992”

Output: “L’accord sur la zone economique europeenne a ete signe en aout 1992.”

Seq2Seq with Attention

Input: “The agreement on the European Economic Area was signed in August 1992”

Output: “L’accord sur la zone economique europeenne a ete signe en aout 1992.”

Seq2Seq with Attention

Seq2Seq with Attention

The decoder doesn’t use the fact that the encoder hidden states form an ordered sequence!

  • It just treats them as an unordered set and looks for similarities

Order is baked into the hidden states via the encoder

  • No need to double bake in order!

Attention Scores

How do we compute the attention weights?

\[ e_{ti} = f(s_{t-1}, h_i) \]

\[ a_{ti} = \sigma(e_{ti}) \]

  • The attention scores should measure the level of similarity between the current context of the decoder and the original input states

  • Give more weight if the current context should leverage a specific part of the input

  • At time \(t\) of the decoder, create a weight for each state of the encoder

Attention Scores

A quick note:

Without loss, we can assume that the hidden state vectors of the encoder \(\mathbf h_i\) and the decoder \(\mathbf s_t\) are normalized to have norm 1

  • These are just vectors, so we can always force them to have a specific size

  • No change in direction, just change in norm

Additionally, lets just assume that the length of each \(\mathbf h_i\) and \(\mathbf s_t\) are the same!

  • Need not be, but much easier computationally

Attention Scores

Goal: Find the key that is most similar to the query

Method 1: Vanilla MLP (Bahdanau attention)

\[ e_{ti} = \mathbf w_2^T \odot \text{tanh}\left(\mathbf W_1 [\mathbf h_i , \mathbf s_{t - 1}] \right) \]

  • Find similarity as a prediction problem and use a 1 layer NN to get the values

  • Concatenate the encoder and decoder hidden states and learn a weight mapping

  • Softmax over all hidden states to get weights

  • Requires learning two weight matrices

  • Generalizes to case where the encoder and decoder hidden states are of different lengths

Attention Scores

Goal: Find the key that is most similar to the query

\[ \mathbf h_i \in \mathbb R^m \text{ ; } \mathbf s_{t-1} \in \mathbb R^m \]

Method 2: Learnable Scaled Dot Product

\[ e_{ti} = \frac{\mathbf h_i^T \mathbf W_g \mathbf s_{t-1}}{\sqrt{m}} \]

  • Find the inner product of the two vectors and divide by the length to scale

  • Learn a \(m \times m\) matrix that controls the mapping!

Attention Scores

Putting this all together, we get a model for each output of the decoder!

  • Place weights where needed to map from one set of attention weighted encoder to a decoder output

  • Note that the decoder output is a weighted combination of the previous decoder state and the attention weighted encoder state

  • A recurrent weighted scheme for producing contextual seq2seq models

  • Fully differentiable, so not too much to cover in terms of fitting

  • Just let autodiff handle it!

Attention Scores

\[ \mathbf y_{t} = f(\mathbf s_t) \text{ ; } \mathbf s_t = f(\mathbf y_{t-1},c_t) \]

\[ \mathbf c_t = \sum \limits_{i = 1}^{T_e} a_{ti} h_i \text{ ; } a_{ti} = \sigma(e_{ti}) \]

\[ e_{ti} = \frac{\mathbf h_i^T \mathbf W_g \mathbf s_{t-1}}{\sqrt{m}} \]

where \(\mathbf h_i\) is learned via a RNN setup!

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Prediction with Attention

Attention is All You Need

RNNs are sequential models by nature

  • Look at the first word, then the first two, then the first three…

Do we even need to do it that way?

Attention is All You Need

Suppose that we have an input sentence:

I arrived at the bank after crossing the street.

At \(t = 8\) of the encoder, we see:

I arrived at the bank after crossing the []

Two completely viable sentences:

  • Street

  • River

Attention is All You Need

There is no hope for a sequential model to learn what should come next!

  • There is no context in the lead up to the next word

What we need is the ability to look ahead when we encode the input

  • And then correct the prediction afterwards to account for the fact that we cheated

Query/Key/Value

Attention Layers will require us to change our terminology a little

  • Query: vector from which the attention is looking. In the seq2seq model, this is the decoder hidden state \(\mathbf s_{t-1}\)

  • Key: vector at which the query looks to compute weights. \(\mathbf h\) - all the hidden states in the encoder

  • Value: the vector to be weighted by the similarity between the query and the key - also the hidden states in the encoder

Attention Layer

What if we didn’t need to even encode ordering?

  • Treat everything as a QKV problem

  • Given a query, compare it to all keys

  • Compute an attention weight

  • Use that to get the context and then make a prediction

Broader attention definition than seq2seq

  • Image to sequence?

Attention Layer

Inputs:

  • Queries: \(\mathbf Q\) - a \(N_Q \times m\) matrix ( \(N_Q\) is the number of hidden states we need to see in the decoder).

  • Input vectors: \(\mathbf X\) - a \(N_X \times D_X\) matrix of input values where each row corresponds to a sequence of length \(P\) (think a word embedding for words in sentences)

  • Key weights: \(\mathbf W_K\) - a \(D_X \times N_Q\) matrix of weights

  • Value weights: \(\mathbf W_V\) - a \(D_X \times D_V\) matrix where \(D_V\) is the length of the output vectors (again, think about the output vectors as embedded versions of words)

Attention Layer

Computations:

  • Key vectors: \(\mathbf K = \mathbf X \mathbf W_K\) a \((N \times m)\) matrix

  • Value vectors: \(\mathbf V = \mathbf X \mathbf W_V\) a \((N \times D_v)\) matrix

  • Similarities: \(\mathbf E = \frac{\mathbf Q \mathbf K^T}{\sqrt{m}}\) a \((N_Q \times N_X)\) matrix

  • Attention weights: \(\mathbf A = \sigma (\mathbf E)\) - normalized over the rows.

  • Output vectors: \(\mathbf Y = \mathbf A \mathbf V\) a \((N_Q \times D_V)\) matrix

Attention Layer

Attention Layer

Attention Layer

Attention Layer

Attention Layer

Attention Layer

Attention Layer

Everything is differentiable

  • So just train the model with autodiff!

This model is close to the seq2seq model, but it doesn’t respect order

Attention Layer

An in between model:

  • Encoder that only uses attention (more or less a complicated flat neural network)

  • Decoder that is sequential that references the encoder via attention

Attention Layer

This model doesn’t quite outperform the seq2seq model yet.

  • Need to better add context and order to the encoder, but do it in a way that doesn’t require sequential processing…

For that, we’ll need to add three things (next time):

  • Self-attention

  • Masked self-attention

  • Multiheaded self-attention